Here the general workflow of sequencing analysis is shown. First we will get the fastq files from the sample sequencing. After that, the raw sequencing data would go through quality control and genome alignment to find the mapped reads. Here the data will be stored in BAM files and be ready for peak calling. Why we need to run peak calling is that the mapped reads are through the whole genome, which could be quite large and complicated. Research would focus on the useful data, that is regions containing biological meanings. That’s why we will see where the mapping fragments are enriched, which could be corresbonding to the biological signals. Finally, when we have the peaks results ready, we can do downstream analysis such as cell type annotation.
Here are some important things to be considered in peak calling.
The first one is the region type of peaks. Some of the peaks are narrow and sharp, others might be broad and blunt, thus to recognize these variant regions, many methods are used to accomendate for the data.
Here we show many algorithms and methods used for peak calling. Here are count-based method, shape-based method, markov model method and deep learning models. Currently most used methods are count-based, and as in our company the pipeline uses MACS2&3, and SICER, we will focus on these methods to discuss. Deep learning method is a new application to biological results. Here in the model called lanceOtron, they used both poisson distribution and shape pattern to double-check the quality and quantity of peaks.
As we do not want to see false peaks being called, the background noise and control data set are tow good aspects to help eliminate the issue. ChIP-seq experiments often generate noise due to non-specific binding of antibodies and other factors. This noise can lead to false positive peak calls. Control datasets (often referred to as IgG controls) represent the DNA sample without the specific antibody enrichment. They reflect the background noise level. By comparing ChIP-seq data with control data, peak calling algorithms can differentiate true binding events from noise. Peaks that are significantly enriched in ChIP but not in the control are more likely to be biologically relevant. The control dataset helps in estimating the baseline noise level and setting appropriate statistical thresholds for peak calling.
As we need to run for a batch of data, I choose to use bash file and snakemake file for running.
#dependecies in this part
pip install pandas
pip install matplotlib
pip install pybedtools
pip install seaborn
For macs2, we set a bash code for both narrow mode and broad mode
file_list=(
"k562_1_h3k4me1"
"k562_1_h3k4me2"
"k562_1_h3k4me3"
"k562_1_h3k27ac"
"k562_1_h3k27me3"
"k562_1_ctcf"
)
for file in "${file_list[@]}"; do
# mkdir "data/macs2/${file}_noControl"
# Build the full command for each file
full_command="macs2 callpeak -f BAMPE -g hs -c data/uniq_hits/k562_1_igg_uniq_hits.bam -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs2/runningDataFinal/${file}_pe"
# Execute the command
echo "Running command for file: ${file}"
$full_command
done
file_list=(
"k562_1_h3k4me1"
"k562_1_h3k4me2"
"k562_1_h3k4me3"
"k562_1_h3k27ac"
"k562_1_h3k27me3"
"k562_1_ctcf"
)
for file in "${file_list[@]}"; do
# mkdir "data/macs2/${file}_noControl"
# Build the full command for each file
full_command="macs2 callpeak -f BAMPE --broad --broad-cutoff 0.1 -g hs -c data/uniq_hits/k562_1_igg_uniq_hits.bam -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs2/runningDataFinal/${file}_broad_pe"
# Execute the command
echo "Running command for file: ${file}"
$full_command
done
file_list=(
"k562_2_h3k4me1"
"k562_2_h3k4me2"
"k562_2_h3k4me3"
"k562_2_h3k27ac"
"k562_2_h3k27me3"
"k562_2_ctcf"
)
for file in "${file_list[@]}"; do
# mkdir "data/macs2/${file}_noControl"
# Build the full command for each file
full_command="macs2 callpeak -f BAMPE -g hs -c data/uniq_hits/k562_2_igg_uniq_hits.bam -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs2/runningDataFinal/${file}_pe"
# Execute the command
echo "Running command for file: ${file}"
$full_command
done
file_list=(
"k562_2_h3k4me1"
"k562_2_h3k4me2"
"k562_2_h3k4me3"
"k562_2_h3k27ac"
"k562_2_h3k27me3"
"k562_2_ctcf"
)
for file in "${file_list[@]}"; do
# mkdir "data/macs2/${file}_noControl"
# Build the full command for each file
full_command="macs2 callpeak -f BAMPE --broad --broad-cutoff 0.1 -g hs -c data/uniq_hits/k562_2_igg_uniq_hits.bam -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs2/runningDataFinal/${file}_broad_pe"
# Execute the command
echo "Running command for file: ${file}"
$full_command
done
Note here we use hs as the genome, which is by default hg19, and paired-end mode for data. For broad mode, we set broad cutoff as q-value = 0.1 for broader peak detect.
For macs3, the code is generally the same.
file_list=(
"k562_1_h3k4me1"
"k562_1_h3k4me2"
"k562_1_h3k4me3"
"k562_1_h3k27ac"
"k562_1_h3k27me3"
"k562_1_ctcf"
)
for file in "${file_list[@]}"; do
# mkdir "data/macs3/${file}_noControl"
# Build the full command for each file
full_command="macs3 callpeak -f BAMPE -g hs -c data/uniq_hits/k562_1_igg_uniq_hits.bam -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs3/runningDataFinal/${file}_pe"
# Execute the command
echo "Running command for file: ${file}"
$full_command
done
file_list=(
"k562_1_h3k4me1"
"k562_1_h3k4me2"
"k562_1_h3k4me3"
"k562_1_h3k27ac"
"k562_1_h3k27me3"
"k562_1_ctcf"
)
for file in "${file_list[@]}"; do
# mkdir "data/macs3/${file}_noControl"
# Build the full command for each file
full_command="macs3 callpeak -f BAMPE --broad --broad-cutoff 0.1 -g hs -c data/uniq_hits/k562_1_igg_uniq_hits.bam -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs3/runningDataFinal/${file}_broad_pe"
# Execute the command
echo "Running command for file: ${file}"
$full_command
done
file_list=(
"k562_2_h3k4me1"
"k562_2_h3k4me2"
"k562_2_h3k4me3"
"k562_2_h3k27ac"
"k562_2_h3k27me3"
"k562_2_ctcf"
)
for file in "${file_list[@]}"; do
# mkdir "data/macs3/${file}_noControl"
# Build the full command for each file
full_command="macs3 callpeak -f BAMPE -g hs -c data/uniq_hits/k562_2_igg_uniq_hits.bam -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs3/runningDataFinal/${file}_pe"
# Execute the command
echo "Running command for file: ${file}"
$full_command
done
file_list=(
"k562_2_h3k4me1"
"k562_2_h3k4me2"
"k562_2_h3k4me3"
"k562_2_h3k27ac"
"k562_2_h3k27me3"
"k562_2_ctcf"
)
for file in "${file_list[@]}"; do
# mkdir "data/macs3/${file}_noControl"
# Build the full command for each file
full_command="macs3 callpeak -f BAMPE --broad --broad-cutoff 0.1 -g hs -c data/uniq_hits/k562_2_igg_uniq_hits.bam -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs3/runningDataFinal/${file}_broad_pe"
# Execute the command
echo "Running command for file: ${file}"
$full_command
done
Note that although the code is same and source code of macs3 is only an update version of macs2, they still gives different results in some situations, such as when background noise is extremly high.
For Epic2, we need to build index for bam files before running. Here we use the ChIP-seq and ATAC-seq data as example.ATAC-seq shows how to run for data without control input.
samtools index data/newForChip/H3K27ac.bam
samtools index data/newForChip/H3K27ac_control.bam
epic2 -t data/newForChip/H3K27ac.bam -c data/newForChip/H3K27ac_control.bam --genome hg38 > data/newForChip/results/epic2/epic2_h3k27ac
samtools index data/newForChip/H3K4me3.bam
samtools index data/newForChip/H3K4me3_control.bam
epic2 -t data/newForChip/H3K4me3.bam -c data/newForChip/H3K4me3_control.bam --genome hg38 > data/newForChip/results/epic2/epic2_h3k4me3
samtools index data/newForChip/CTCF.bam
samtools index data/newForChip/CTCF_control.bam
epic2 -t data/newForChip/CTCF.bam -c data/newForChip/CTCF_control.bam --genome hg38 > data/newForChip/results/epic2/epic2_ctcf
samtools index data/newForChip/ATACseq.bam
epic2 -t data/newForChip/ATACseq.bam --genome hg38 > data/newForChip/results/epic2/epic2_ATAC
Last peak caller is LanceOtron, the deep learning method.The first step is to sort and build index for bam files. Then the deep learning model requires bigwig file as input, so we use bamCoverage tool to build bw files. We use RPKM for normalizing. Then we can select whether to use control input or not.
samtools sort -o data/lanceotron/k562_2_igg_sorted.bam data/lanceotron/k562_2_igg.bam
samtools sort -o data/lanceotron/k562_2_h3k4me3_sorted.bam data/lanceotron/k562_2_h3k4me3_igg.bam
#build index
samtools index data/lanceotron/k562_2_igg_sorted.bam
samtools index data/lanceotron/k562_2_h3k4me3_sorted.bam
bamCoverage --bam data/uniq_hits/k562_1_ctcf_uniq_hits.bam -o data/lanceotron/k562_1_ctcf/1_ctcf.bw --extendReads -bs 1 --normalizeUsing RPKM
bamCoverage --bam data/uniq_hits/k562_1_igg_uniq_hits.bam -o data/lanceotron/k562_1_igg/1_igg.bw --extendReads -bs 1 --normalizeUsing RPKM
file_list=(
"k562_2_h3k4me2"
"k562_2_h3k4me1"
"k562_2_h3k27ac"
"k562_2_ctcf"
)
input_igg="data/lanceotron/k562_2_igg/k562_2_igg.bw"
for file_name in "${file_list[@]}"; do
lanceotron callPeaksInput "data/lanceotron/${file_name}/${file_name}.bw" -i "${input_igg}" -f "data/lanceotron/${file_name}/"
done
In this part we used both macs2 and macs3 for their default narrow mode and broad mode. As in MACS documentation they said the model would choose the maximum of lambda 1000, 5000, 10000 and background, we tried a lambda range from 1000 to 10000, background and default to see.
Here’s the code for macs2
#!/bin/bash
# Define the command and common options
command="macs2 callpeak -c data/uniq_hits/k562_1_igg_uniq_hits.bam -f BAMPE --broad --broad-cutoff 0.1 -g hs"
# Define the lambda values
lambda_values=(
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
)
# Define the file list
file_list=(
"k562_1_h3k4me1"
"k562_1_h3k4me2"
"k562_1_h3k4me3"
"k562_1_h3k27ac"
"k562_1_ctcf"
"k562_1_h3k27me3"
)
# Iterate over the file list
for file in "${file_list[@]}"; do
# Iterate over the lambda values
for lambda in "${lambda_values[@]}"; do
# Build the full command with lambda for k562_1_h3k4me3
full_command="$command -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs2/${file}/${file}_broad_lambda${lambda} --llocal $lambda"
# Execute the command
echo "Running command with lambda = $lambda for $file"
$full_command
done
# Build the full command without lambda for k562_1_h3k4me3
full_command="$command -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs2/${file}/${file}_broad_nolambda --nolambda"
# Execute the command
echo "Running command without lambda for $file"
$full_command
done
# Define the command and common options
command="macs2 callpeak -c data/uniq_hits/k562_2_igg_uniq_hits.bam -f BAMPE -g hs"
# Define the lambda values
lambda_values=(
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
)
# Define the file list
file_list=(
"k562_2_h3k4me1"
"k562_2_h3k4me2"
"k562_2_h3k4me3"
"k562_2_h3k27ac"
# "k562_1_h3k27me3"
"k562_2_ctcf"
)
# Iterate over the file list
for file in "${file_list[@]}"; do
# Iterate over the lambda values
for lambda in "${lambda_values[@]}"; do
# Build the full command with lambda for k562_1_h3k4me3
full_command="$command -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs2/${file}/${file}_lambda${lambda} --llocal $lambda"
# Execute the command
echo "Running command with lambda = $lambda for $file"
$full_command
done
# Build the full command without lambda for k562_1_h3k4me3
full_command="$command -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs2/${file}/${file}_nolambda --nolambda"
# Execute the command
echo "Running command without lambda for $file"
$full_command
done
Here’s the code for macs3
#!/bin/bash
# Define the command and common options
command="macs3 callpeak -c data/uniq_hits/k562_1_igg_uniq_hits.bam -f BAMPE --broad --broad-cutoff 0.1 -g hs"
# Define the lambda values
lambda_values=(
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
)
# Define the file list
file_list=(
"k562_1_h3k4me1"
"k562_1_h3k4me2"
"k562_1_h3k4me3"
"k562_1_h3k27ac"
"k562_1_ctcf"
"k562_1_h3k27me3"
)
# Iterate over the file list
for file in "${file_list[@]}"; do
# Iterate over the lambda values
for lambda in "${lambda_values[@]}"; do
# Build the full command with lambda for k562_1_h3k4me3
full_command="$command -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs3/${file}/${file}_broad_lambda${lambda} --llocal $lambda"
# Execute the command
echo "Running command with lambda = $lambda for $file"
$full_command
done
# Build the full command without lambda for k562_1_h3k4me3
full_command="$command -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs3/${file}/${file}_broad_nolambda --nolambda"
# Execute the command
echo "Running command without lambda for $file"
$full_command
done
# Define the command and common options
command="macs3 callpeak -c data/uniq_hits/k562_2_igg_uniq_hits.bam -f BAMPE -g hs"
# Define the lambda values
lambda_values=(
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
)
# Define the file list
file_list=(
"k562_2_h3k4me1"
"k562_2_h3k4me2"
"k562_2_h3k4me3"
"k562_2_h3k27ac"
# "k562_1_h3k27me3"
"k562_2_ctcf"
)
# Iterate over the file list
for file in "${file_list[@]}"; do
# Iterate over the lambda values
for lambda in "${lambda_values[@]}"; do
# Build the full command with lambda for k562_1_h3k4me3
full_command="$command -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs3/${file}/${file}_lambda${lambda} --llocal $lambda"
# Execute the command
echo "Running command with lambda = $lambda for $file"
$full_command
done
# Build the full command without lambda for k562_1_h3k4me3
full_command="$command -t data/uniq_hits/${file}_uniq_hits.bam -n data/macs3/${file}/${file}_nolambda --nolambda"
# Execute the command
echo "Running command without lambda for $file"
$full_command
done
after running peak callers, we would like to see the results in different lambda choice. Here we only take H3K4me1 code as example.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.3
#----------------------------------------------h3k4me1------------------------------------
# Load the data for MACS2
macs2_lambda_default <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_pe_peaks.xls", header = TRUE))
macs2_lambda_no <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_nolambda_peaks.xls", header = TRUE))
macs2_lambda_1000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_lambda1000_peaks.xls", header = TRUE))
macs2_lambda_2000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_lambda2000_peaks.xls", header = TRUE))
macs2_lambda_3000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_lambda3000_peaks.xls", header = TRUE))
macs2_lambda_4000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_lambda4000_peaks.xls", header = TRUE))
macs2_lambda_5000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_lambda5000_peaks.xls", header = TRUE))
macs2_lambda_6000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_lambda6000_peaks.xls", header = TRUE))
macs2_lambda_7000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_lambda7000_peaks.xls", header = TRUE))
macs2_lambda_8000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_lambda8000_peaks.xls", header = TRUE))
macs2_lambda_9000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_lambda9000_peaks.xls", header = TRUE))
macs2_lambda_10000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_lambda10000_peaks.xls", header = TRUE))
# Load the data for MACS3
macs3_lambda_default <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_pe_peaks.xls", header = TRUE))
macs3_lambda_no <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_nolambda_peaks.xls", header = TRUE))
macs3_lambda_1000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_lambda1000_peaks.xls", header = TRUE))
macs3_lambda_2000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_lambda2000_peaks.xls", header = TRUE))
macs3_lambda_3000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_lambda3000_peaks.xls", header = TRUE))
macs3_lambda_4000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_lambda4000_peaks.xls", header = TRUE))
macs3_lambda_5000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_lambda5000_peaks.xls", header = TRUE))
macs3_lambda_6000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_lambda6000_peaks.xls", header = TRUE))
macs3_lambda_7000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_lambda7000_peaks.xls", header = TRUE))
macs3_lambda_8000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_lambda8000_peaks.xls", header = TRUE))
macs3_lambda_9000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_lambda9000_peaks.xls", header = TRUE))
macs3_lambda_10000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_lambda10000_peaks.xls", header = TRUE))
# Load the data for MACS2
macs2_lambda_broad_default <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_broad_pe_peaks.xls", header = TRUE))
macs2_lambda_broad_no <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_broad_nolambda_peaks.xls", header = TRUE))
macs2_lambda_broad_1000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda1000_peaks.xls", header = TRUE))
macs2_lambda_broad_2000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda2000_peaks.xls", header = TRUE))
macs2_lambda_broad_3000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda3000_peaks.xls", header = TRUE))
macs2_lambda_broad_4000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda4000_peaks.xls", header = TRUE))
macs2_lambda_broad_5000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda5000_peaks.xls", header = TRUE))
macs2_lambda_broad_6000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda6000_peaks.xls", header = TRUE))
macs2_lambda_broad_7000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda7000_peaks.xls", header = TRUE))
macs2_lambda_broad_8000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda8000_peaks.xls", header = TRUE))
macs2_lambda_broad_9000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda9000_peaks.xls", header = TRUE))
macs2_lambda_broad_10000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs2/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda10000_peaks.xls", header = TRUE))
# Load the data for MACS3
macs3_lambda_broad_default <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_broad_pe_peaks.xls", header = TRUE))
macs3_lambda_broad_no <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_broad_nolambda_peaks.xls", header = TRUE))
macs3_lambda_broad_1000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda1000_peaks.xls", header = TRUE))
macs3_lambda_broad_2000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda2000_peaks.xls", header = TRUE))
macs3_lambda_broad_3000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda3000_peaks.xls", header = TRUE))
macs3_lambda_broad_4000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda4000_peaks.xls", header = TRUE))
macs3_lambda_broad_5000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda5000_peaks.xls", header = TRUE))
macs3_lambda_broad_6000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda6000_peaks.xls", header = TRUE))
macs3_lambda_broad_7000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda7000_peaks.xls", header = TRUE))
macs3_lambda_broad_8000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda8000_peaks.xls", header = TRUE))
macs3_lambda_broad_9000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda9000_peaks.xls", header = TRUE))
macs3_lambda_broad_10000 <- nrow(read.table("D:/peakCalling/gopeaks-compare-main/data/macs3/k562_1_h3k4me1/k562_1_h3k4me1_broad_lambda10000_peaks.xls", header = TRUE))
# Create data frames with the number of rows in each file for MACS2 and MACS3
df_macs2 <- data.frame(Method = "MACS2",
Lambda = c("default", 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 9999, "bg"),
Rows = c(macs2_lambda_default, macs2_lambda_1000, macs2_lambda_2000, macs2_lambda_3000,
macs2_lambda_4000, macs2_lambda_5000, macs2_lambda_6000,
macs2_lambda_7000, macs2_lambda_8000, macs2_lambda_9000,
macs2_lambda_10000, macs2_lambda_no))
df_macs3 <- data.frame(Method = "MACS3",
Lambda = c("default",1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 9999, "bg"),
Rows = c(macs3_lambda_default,macs3_lambda_1000, macs3_lambda_2000, macs3_lambda_3000, macs3_lambda_4000,
macs3_lambda_5000, macs3_lambda_6000, macs3_lambda_7000, macs3_lambda_8000,
macs3_lambda_9000, macs3_lambda_10000, macs3_lambda_no))
df_macs2_broad <- data.frame(Method = "MACS2 broad",
Lambda = c("default", 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 9999, "bg"),
Rows = c(macs2_lambda_broad_default, macs2_lambda_broad_1000, macs2_lambda_broad_2000, macs2_lambda_broad_3000,
macs2_lambda_broad_4000, macs2_lambda_broad_5000, macs2_lambda_broad_6000,
macs2_lambda_broad_7000, macs2_lambda_broad_8000, macs2_lambda_broad_9000,
macs2_lambda_broad_10000,macs2_lambda_broad_no))
df_macs3_broad <- data.frame(Method = "MACS3 broad",
Lambda = c("default",1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 9999, "bg"),
Rows = c(macs3_lambda_broad_default, macs3_lambda_broad_1000, macs3_lambda_broad_2000, macs3_lambda_broad_3000, macs3_lambda_broad_4000,
macs3_lambda_broad_5000, macs3_lambda_broad_6000, macs3_lambda_broad_7000, macs3_lambda_broad_8000,
macs3_lambda_broad_9000, macs3_lambda_broad_10000, macs3_lambda_broad_no))
# Combine the data frames
combined_df <- rbind(df_macs2, df_macs3, df_macs2_broad, df_macs3_broad)
# Create a barplot with a title and colored bars, using different colors for each method
# Your existing ggplot code
ggplot(combined_df, aes(x = factor(Lambda), y = Rows, fill = Method)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Lambda", y = "Number of peaks found", title = "") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = Rows), position = position_dodge(width = 0.9), vjust = -0.5, size = 2) +
ggtitle("Lambda Choices Comparison in h3k4me1: MACS2 and MACS3")
In this part we use the ENCODE ChIP-seq data as ground truth. We use bedtools for intersect check and calculate for true positive, true negative, false positive and false negative. We use bash file for the run.
Here we only take H3K4me1 as example
# H3K4me1
calculate_metrics() {
method_name="$1"
your_peak_calling_result="$2"
encode_chipseq="src/standards/ENCFF759NWD.bed"
# Calculate True Positives (TP)
TP=$(bedtools intersect -u -a "$your_peak_calling_result" -b "$encode_chipseq" | wc -l)
# Calculate False Positives (FP)
FP=$(bedtools intersect -v -a "$your_peak_calling_result" -b "$encode_chipseq" | wc -l)
# Calculate False Negatives (FN)
FN=$(bedtools intersect -v -a "$encode_chipseq" -b "$your_peak_calling_result" | wc -l)
# Calculate True Negatives (TN)
TN=$(bedtools intersect -u -a "$encode_chipseq" -b "$your_peak_calling_result" | wc -l)
# total_genome_regions=$(wc -l < "$encode_chipseq")
# TN=$((total_genome_regions - TP - FP - FN))
# TN=$(bedtools intersect -u -a "$encode_chipseq" -b "$your_peak_calling_result" | wc -l)
# Calculate Precision (Positive Predictive Value)
precision=$(awk "BEGIN {print $TP / ($TP + $FP)}")
# Calculate Recall (Sensitivity or True Positive Rate)
recall=$(awk "BEGIN {print $TP / ($TP + $FN)}")
#calculate selectivity
Selectivity=$(awk "BEGIN {print $TN / ($FP + $TN)}")
# Calculate Accuracy
Accuracy=$(awk "BEGIN {print ($TN + $TP) / ($TP + $FN + $FP + $TN)}")
# Calculate F1 Score
F1_score=$(awk "BEGIN {print (2 * $TP) / (2 * $TP + $FP + $FN)}")
# Return the results as a string
echo -e "$method_name\t$TP\t$FP\t$FN\t$TN\t$precision\t$recall\t$F1_score\t$Accuracy\t$Selectivity"
}
# Header for the xls file
echo -e "Method\tTrue Positives (TP)\tFalse Positives (FP)\tFalse Negatives (FN)\tTrue Negatives (TN)\tPrecision\tRecall\tF1 Score\tAcurracy\tSelectivity" > "data/bar/CnR_h3k4me1_metrics.xls"
# Calculate metrics and append to the xls file for each method
calculate_metrics "lanceotron" "data/lanceotron/runningData/k562_h3k4me1_peaks.xls" >> "data/bar/CnR_h3k4me1_metrics.xls"
calculate_metrics "macs2" "data/macs2/runningData/k562_h3k4me1_peak.narrowPeak" >> "data/bar/CnR_h3k4me1_metrics.xls"
calculate_metrics "macs2broad" "data/macs2/runningDataBroad/k562_h3k4me1_broad_peak.broadPeak" >> "data/bar/CnR_h3k4me1_metrics.xls"
calculate_metrics "macs3" "data/macs3/runningData/k562_h3k4me1_peak.narrowPeak" >> "data/bar/CnR_h3k4me1_metrics.xls"
calculate_metrics "macs3broad" "data/macs3/runningDataBroad/k562_h3k4me1_broad_peak.broadPeak" >> "data/bar/CnR_h3k4me1_metrics.xls"
calculate_metrics "epic2" "data/epic2/runningData/k562_h3k4me1_peaks.narrowPeak" >> "data/bar/CnR_h3k4me1_metrics.xls"
## Method True.Positives..TP. False.Positives..FP. False.Negatives..FN.
## 1 lanceotron 189584 293959 12759
## 2 macs2 9054 251 108567
## 3 macs2broad 10699 293 105844
## 4 macs3 9054 251 108567
## 5 macs3broad 10699 293 105844
## 6 epic2 72700 26491 24137
## True.Negatives..TN. Precision Recall F1.Score Acurracy X X.1 X.2
## 1 103648 0.392073 0.9369440 0.552815 0.488761 0.5926482 6 1
## 2 7840 0.973025 0.0769761 0.142666 0.134387 0.3317635 3 5
## 3 10563 0.973344 0.0918030 0.167781 0.166893 0.3499552 1 3
## 4 7840 0.973025 0.0769761 0.142666 0.134387 0.3317635 3 5
## 5 10563 0.973344 0.0918030 0.167781 0.166893 0.3499552 1 3
## 6 92270 0.732929 0.7507460 0.741731 0.765174 0.7476450 5 2
## X.3 X.4 X.5 X.6 X.7
## 1 2 2 1 2.4 2
## 2 5 5 5 4.6 5
## 3 3 3 3 2.6 3
## 4 5 5 5 4.6 5
## 5 3 3 3 2.6 3
## 6 1 1 2 2.2 1
We use barplot here for model evaluation
library(ggplot2)
h3k4me1Comp <- read.table("D:/peakCalling/gopeaks-compare-main/data/bar/CnR_h3k4me1_metrics.xls", header = T, sep = "\t")
h3k4me1Comp <- h3k4me1Comp[, c(1,6:9)]
h3k4me1Comp <- reshape2::melt(h3k4me1Comp, id.vars="Method")
ggplot(h3k4me1Comp, aes(x = Method, y = value, fill = variable)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Barplot of Cut&Run h3k4me1: Precision ,Recall, F1 score, Accuracy",
x = "Method",
y = "Score",
fill = "Score") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(size = 12, face = "bold"), # Adjust font size and make it bold for x-axis labels
axis.text.y = element_text(size = 12, face = "bold")) # Adjust font size and make it bold for y-axis labels
We use snakemake first to calculate for the ROC, then use python code for plotting
rule evaluate_macs2:
input:
"data/macs2/runningData/{sample}_peak.xls"
output:
"data/evaluate_models/macs2_{sample}.txt"
conda:
"../envs/eval.yml"
# standard=get_standard
threads: 4
resources:
mem_mb = 2000
shell:
"bash src/custom/evals/evaluate_macs2.sh -i {wildcards.sample} -s src/standards/ENCFF759NWD.bed -t {threads} > {output}"
rule evaluate_macs3:
input:
"data/macs3/runningData/{sample}_peak.xls"
output:
"data/evaluate_models/macs3_{sample}.txt"
conda:
"../envs/eval.yml"
# standard=get_standard
threads: 4
resources:
mem_mb = 2000
shell:
"bash src/custom/evals/evaluate_macs3.sh -i {wildcards.sample} -s src/standards/ENCFF759NWD.bed -t {threads} > {output}"
rule evaluate_epic2:
input:
"data/epic2/runningData/{sample}_peak.txt"
output:
"data/evaluate_models/epic2_{sample}.txt"
conda:
"../envs/eval.yml"
threads: 4
resources:
mem_mb = 2000
shell:
"bash src/custom/evals/evaluate_epic2.sh -i {wildcards.sample} -s src/standards/ENCFF759NWD.bed -t {threads} > {output}"
rule evaluate_lanceOtron:
input:
"data/lanceotron/runningData/{sample}_peaks.xls"
output:
"data/evaluate_models/lanceOtron_{sample}.txt"
conda:
"../envs/eval.yml"
# standard=get_standard
threads: 4
resources:
mem_mb = 2000
shell:
"bash src/custom/evals/evaluate_otron.sh -i {wildcards.sample} -s src/standards/ENCFF759NWD.bed -t {threads} > {output}"
The evaluate.sh files are as this. Here take macs2 as example. Real situation depends on methods and files they generated.
#!/bin/bash
while getopts "i:s:t:" op
do
case "$op" in
i) sample="$OPTARG";;
s) standard="$OPTARG";;
t) threads="$OPTARG";;
\?) exit 1;;
esac
done
# variables and functions ---------------------------------------------------------------
calculate_rate() {
awk -v true_term=$1 -v false_term=$2 'BEGIN {print true_term / (true_term + false_term)}' | cut -b1-5
}
calculate_f1_score() {
awk -v true_term=$1 -v false_term=$2 'BEGIN {print 2 * (true_term * false_term) / (true_term + false_term)}' | cut -b1-5
}
threshold() {
counts=$1
predicted_truth=$(cat $sample_file_handle | awk -v s=$1 '$11 <= s')
predicted_false=$(cat $sample_file_handle | awk -v s=$1 '$11 > s')
TP=$(echo "$predicted_truth" | cut -f1-3 | bedtools intersect -u -a - -b $standard | wc -l)
FP=$(echo "$predicted_truth" | cut -f1-3 | bedtools intersect -v -a - -b $standard | wc -l)
FN=$(echo "$predicted_false" | cut -f1-3 | bedtools intersect -u -a - -b $standard | wc -l)
TN=$(echo "$predicted_false" | cut -f1-3 | bedtools intersect -v -a - -b $standard | wc -l)
# calculate precision, recall, F1. export results.
precision=$(calculate_rate $TP $FP)
recall=$(calculate_rate $TP $FN)
fpr=$(calculate_rate $FP $TN)
f1=$(calculate_f1_score $precision $recall)
echo -e "$method\t$condition\t$replicate\t$mark\t$counts\t$TP\t$FP\t$FN\t$TN\t$precision\t$recall\t$fpr\t$f1"
}
# file I/O ------------------------------------------------------------------------------
file=$(find data/macs2/runningData -name "${sample}*Peak") # this workaround accommodates for narrowPeak and broadPeak inputs
# identify method,condition,replicate,mark
method="macs2"
condition=$(basename $file | cut -d_ -f1)
replicate=$(basename $file | cut -d_ -f2)
mark=$(basename $file | cut -d_ -f3)
echo -e "method\tcondition\treplicate\tmark\tsignal\tTP\tFP\tFN\tTN\tprecision\trecall\tfpr\tf1" # header
# file prep -----------------------------------------------------------------------------
# create a new column with pval in decimal form. Sort by that new column.
# identify all pvals within a sample.
sample_file=$(awk -v OFS='\t' '{print $0,10**-$9}' $file | grep -vi "inf" | sort -rg -k11)
sample_pvals=$(echo "$sample_file" | cut -f11 | uniq)
sample_file_handle="data/evaluate_models/macs2_${sample}.tmp.txt" # create tmp file to threshold from
echo "$sample_file" > $sample_file_handle
# process peaks sequentially
for pval in $sample_pvals; do
threshold $pval
done
rm $sample_file_handle
for the output files, we can use python sklearn module for plotting.
from sklearn.metrics import roc_curve, auc
import pandas as pd
import matplotlib.pyplot as plt
#for h3k4me1
epic2_data_h3k4me1 = pd.read_csv("D:/peakCalling/gopeaks-compare-main/data/evaluate_models/epic2_k562_h3k4me1.txt", delimiter='\t')
macs3_pe_h3k4me1 = pd.read_csv("D:/peakCalling/gopeaks-compare-main/data/evaluate_models/macs3_k562_h3k4me1.txt", delimiter='\t')
macs2_pe_h3k4me1 = pd.read_csv("D:/peakCalling/gopeaks-compare-main/data/evaluate_models/macs3_k562_h3k4me1.txt", delimiter='\t')
macs3_peb_h3k4me1 = pd.read_csv("D:/peakCalling/gopeaks-compare-main/data/evaluate_models/macs3_k562_h3k4me1_broad.txt", delimiter='\t')
macs2_peb_h3k4me1 = pd.read_csv("D:/peakCalling/gopeaks-compare-main/data/evaluate_models/macs2_k562_h3k4me1_broad.txt", delimiter='\t')
# Sort the data by increasing signal values
epic2_data_h3k4me1.sort_values('signal', inplace=True)
macs3_pe_h3k4me1.sort_values('signal', inplace=True)
macs2_pe_h3k4me1.sort_values('signal', inplace=True)
macs3_peb_h3k4me1.sort_values('signal', inplace=True)
macs2_peb_h3k4me1.sort_values('signal', inplace=True)
# Calculate True Positive Rate (Recall)
tpr_epic2_data_h3k4me1 = epic2_data_h3k4me1['recall']
tpr_macs3_pe_h3k4me1 = macs3_pe_h3k4me1['recall']
tpr_macs2_pe_h3k4me1 = macs2_pe_h3k4me1['recall']
tpr_macs3_peb_h3k4me1 = macs3_peb_h3k4me1['recall']
tpr_macs2_peb_h3k4me1 = macs2_peb_h3k4me1['recall']
# Calculate False Positive Rate
fpr_epic2_data_h3k4me1 = epic2_data_h3k4me1['fpr']
fpr_macs3_pe_h3k4me1 = macs3_pe_h3k4me1['fpr']
fpr_macs2_pe_h3k4me1 = macs2_pe_h3k4me1['fpr']
fpr_macs3_peb_h3k4me1 = macs3_peb_h3k4me1['fpr']
fpr_macs2_peb_h3k4me1 = macs2_peb_h3k4me1['fpr']
# precision
pre_epic2_data_h3k4me1 = epic2_data_h3k4me1['precision']
pre_macs3_pe_h3k4me1 = macs3_pe_h3k4me1['precision']
pre_macs2_pe_h3k4me1 = macs2_pe_h3k4me1['precision']
pre_macs3_peb_h3k4me1 = macs3_peb_h3k4me1['precision']
pre_macs2_peb_h3k4me1 = macs2_peb_h3k4me1['precision']
# Calculate Area Under the Curve (AUC)
auc_epic2_data_h3k4me1 = auc(fpr_epic2_data_h3k4me1, tpr_epic2_data_h3k4me1)
auc_macs3_pe_h3k4me1 = auc(fpr_macs3_pe_h3k4me1, tpr_macs3_pe_h3k4me1)
auc_macs2_pe_h3k4me1 = auc(fpr_macs2_pe_h3k4me1, tpr_macs2_pe_h3k4me1)
auc_macs3_peb_h3k4me1 = auc(fpr_macs3_peb_h3k4me1, tpr_macs3_peb_h3k4me1)
auc_macs2_peb_h3k4me1 = auc(fpr_macs2_peb_h3k4me1, tpr_macs2_peb_h3k4me1)
# Set the background color to gray
plt.rcParams['axes.facecolor'] = 'lightgray'
# Plot the ROC curve
plt.figure(figsize=(10, 6))
# Plot EPIC2 ROC curve
plt.plot(fpr_epic2_data_h3k4me1, tpr_epic2_data_h3k4me1, label=f"EPIC2 (AUC = {auc_epic2_data_h3k4me1:.2f})", linewidth=2, color='green')
# Plot MACS2 ROC curve
plt.plot(fpr_macs3_pe_h3k4me1, tpr_macs3_pe_h3k4me1, label=f"MACS3 (AUC = {auc_macs3_pe_h3k4me1:.2f})", linewidth=2, color='c')
# Plot MACS3 ROC curve
plt.plot(fpr_macs2_pe_h3k4me1, tpr_macs2_pe_h3k4me1, label=f"MACS2 (AUC = {auc_macs2_pe_h3k4me1:.2f})", linewidth=2, color='m')
# Plot MACS2 ROC curve
plt.plot(fpr_macs3_peb_h3k4me1, tpr_macs3_peb_h3k4me1, label=f"MACS3 broad(AUC = {auc_macs3_peb_h3k4me1:.2f})", linewidth=2, color='y')
# Plot MACS3 ROC curve
plt.plot(fpr_macs2_peb_h3k4me1, tpr_macs2_peb_h3k4me1, label=f"MACS2 broad(AUC = {auc_macs2_peb_h3k4me1:.2f})", linewidth=2, color='brown')
# Plot EPIC2 ROC curve
plt.plot(fpr_epic2_data_h3k4me1, tpr_epic2_data_h3k4me1, linewidth=2, color='green')
# Plot MACS2 ROC curve
plt.plot(fpr_macs3_pe_h3k4me1, tpr_macs3_pe_h3k4me1, linewidth=2, color='c')
# Plot MACS3 ROC curve
plt.plot(fpr_macs2_pe_h3k4me1, tpr_macs2_pe_h3k4me1, linewidth=2, color='m')
# Plot MACS2 ROC curve
plt.plot(fpr_macs3_peb_h3k4me1, tpr_macs3_peb_h3k4me1, linewidth=2, color='y')
# Plot MACS3 ROC curve
plt.plot(fpr_macs2_peb_h3k4me1, tpr_macs2_peb_h3k4me1, linewidth=2, color='brown')
# Plot No Skill ROC curve
plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='No Skill')
# Set plot labels and title
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC curve for h3k4me1')
plt.legend(loc='lower right')
# Show the plot
plt.show()
# Set the background color to gray
plt.rcParams['axes.facecolor'] = 'lightgray'
# Plot the Precision-Recall curve
plt.figure(figsize=(10, 6))
# Plot EPIC2 Precision-Recall curve
plt.plot(tpr_epic2_data_h3k4me1, pre_epic2_data_h3k4me1, label=f"EPIC2 PR", linewidth=2, color='green')
# Plot MACS3 Precision-Recall curve
plt.plot(tpr_macs3_pe_h3k4me1, pre_macs3_pe_h3k4me1, label=f"MACS3 PR", linewidth=2, color='c')
# Plot MACS2 Precision-Recall curve
plt.plot(tpr_macs2_pe_h3k4me1, pre_macs2_pe_h3k4me1, label=f"MACS2 PR", linewidth=2, color='m')
# Plot MACS3 Precision-Recall curve
plt.plot(tpr_macs3_peb_h3k4me1, pre_macs3_peb_h3k4me1, label=f"MACS3 broad PR", linewidth=2, color='y')
# Plot MACS2 Precision-Recall curve
plt.plot(tpr_macs2_peb_h3k4me1, pre_macs2_peb_h3k4me1, label=f"MACS2 broad PR", linewidth=2, color='brown')
# Set plot labels and title
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall curve for h3k4me1')
plt.legend(loc='lower left')
# Show the plot
plt.show()
When studying MACS2 and MACS3 algorithm, we noticed in its design they use a dynamic lambda to guarantee the precision of peak detection. This is specially designed in MACS algorithm to make it distinguished from other peak callers. The lambda represents for the noise level around the peak. The MACS algorithm will detect the maximum noise of whole genome background, the noise level around 1,000 bp neighborhood, the noise level around 5,000 bp neighborhood and the noise level around 10,000 bp neighborhood. Here we use lambda value from 1,000 to 10,000, increment by 1,000 each to see the peak calling results.
The conclusion is that default lambda value always calls least peaks, MACS2 and MACS3 both show good robustness against lambda choice.
For this part, we did three kinds of analysis for each histone mark or transcription factor. Curate benchmark datasets:Cut&Tag, ChIP-seq, ATAC-seq Run peak callers for all the datasets Evaluate/compare peak calling tools: Number of peaks Accuracy (correctly predicted proportion) Precision (ability to distinguish positive and negative) Recall (correctly predicted for positive) F1 score (false predicted level) AUC/ROC (show predictive power) IGV plots (peak patterns) With control vs without control Results interpretation & Visualization
Here’s a workflow chart for result analysis part.
Next, we will show the results in Cut&Tag, ChIP-seq and ATAC-seq
H3K4me1
This histone modification we only have the Cut&Tag data
First we see the peak number result
Peak numbers: LanceOtron > epic2 > macs2&3 broad mode > macs2&3 With control vs without control: Has a significant influence on macs2&macs3; increase in epic2 peaks; no much affect on LanceOtron
Epic2 achieves the best performance, but the precision is lower than other methods because of its false positive detection If no control input used, the overall performance of model would improve, but precision would be lower Performance of MACS2&MACS3 broad mode is significantly influenced No much influence on deep learning model.
If no control input used, the overall performance of model would improve, but precision would be lower Performance of MACS2&MACS3 broad mode is significantly influenced No much influence on deep learning model.
Different modes of MACS2&MACS3, with or without control input. Overall, if control input is provided, models has better performance on precision. Even though control quality is bad, it helps in decrease wrongly classified peaks
IGV plot shows that the peak mode is blunt and broad Deep Learning model detects much more peaks EPIC2 tends to detect much broader peaks and merge small peaks.
H3K4me2
This histone modification we only have Cut&Tag data.
Peak numbers: LanceOtron > epic2 > macs2&3 With control vs without control: Has a significant influence on macs2&macs3; increase in epic2 peaks; no much affect on LanceOtron
If no control input used, the overall performance of model would improve, but precision would be lower Performance of MACS2&MACS3 is better, but still not better than EPIC2 No much influence on deep learning model.
IGV plot shows that it has sharp peaks but still contain broader peaks Deep Learning model detects much more peaks EPIC2 tends to detect much broader peaks and merge small peaks. The background noise confounded the peak recognition of macs
H3K4me3
This histone modification we have both Cut&Tag and ChIP-seq data, so we put them together to analyze
Peak numbers: LanceOtron > epic2 > macs2&3 With control vs without control: Has increase in epic2 and macs2&macs3; no much affect on LanceOtron All methods are robust in H3K4me3
Generally, the epic2 has the best performance among Cut&Tag and lanceotron do better on ChIP-seq data When no control input provided, MACS family shows better robustness
IGV plot shows that the peak mode is sharp and clear Deep Learning model detects much more peaks EPIC2 and MACS broad mode tends to detect much broader peaks and merge small peaks.
H3K27me3
This histone modification we only have Cut&Tag data, and it is a typically broad peak.
Peak numbers: LanceOtron > epic2 > macs2&3 With control vs without control: Increase on macs2&macs3 and epic2 peaks
EPIC2 shows overall the best performance If no control input used, the broad mode of macs2&3 performs better No much influence on EPIC2 and deep learning model.
IGV plot shows that the peak mode is much blunt and broad. The background noise confounded the peak recognition of MACS Deep Learning model detects much more peaks EPIC2 tends to detect much broader peaks and merge small peaks.
H3K27ac
This histone modification we have both Cut&Tag and ChIP-seq data, so we put them together to analyze
Peak numbers: LanceOtron > epic2 > macs2&3 in Cut&Tag; epic2 > LanceOtron > macs2&3 in ChIP-seq data With/without control: significant increase in macs2&macs3; no much affect on LanceOtron
Without control, the overall performance would increase significantly in Cut&Tag data but slightly decrease in ChIP-seq None of the methods performs well on mixed region type
IGV plot shows that the peak mode is combined of sharp and blunt
EPIC2 and MACS broad mode tends to detect much broader peaks and merge small peaks.
CTCF
This transcription factor we have both Cut&Tag and ChIP-seq data, so we put them together to analyze
Peak numbers: LanceOtron > epic2 > macs2&3 broad > macs2&3 With control vs without control: Has increase in epic2 and macs2&macs3; no much affect on LanceOtron
Overall, no model shows very satisfactory performance among Cut&Tag data. In ChIP-seq data, macs2&3 have the best performance, especially the narrow mode when no control input provided There’s no significant difference in broad mode and narrow mode of MACS in CTCF
IGV plot shows that the peak mode is sharp and clear Deep Learning model detects much more peaks EPIC2 detects for broader peaks
ATAC-seq data
ATAC-seq data is for detection of open sites in chromatin. The peak enrichment indicates for open regions
Note that ATAC seq does not have control input (most cases)
Peak numbers: LanceOtron > epic2 > macs2&3 broad mode > macs2&3
MACS2&3 have the best performance, especially in narrow mode. Only EPIC2 performance is poor, maybe because of ATAC-seq peak is sharp when showing the open sites
IGV plot shows that the peak mode is sharp and close to each other
General Conclusion for the results analysis For Cut&Tag data, epic2 often achieves the best performance
MACS2 and MACS3 have robust performance on ChIP-seq data, which makes them reliable choices
Deep learning model always achieves more peaks with lower precision
MACS2&3 sometimes cannot recognize blunt peaks, especially narrow in mode
Epic2 tends to detect for broader peaks and merge small peaks
Ranking system and suggestion table We do an aggregate ranking here to make suggestions for peak callers.
The first step is, for each factor, calculate rank for each method based on peak number, precision, recall, F1 score and accuracy. Rank ranges from 1 to 6 where 1 represents the best method and 6 is the worst method The second step is to calculate aggregate rank based on averaging the above 5 different ranks The third step is to calculate aggregate rank for each factor separately and summarize the final ranking in the table
Table for Cut&Tag We will give a comprehensive summary for each factor. For H3K4me1, EPIC2 has the best overall performance, then followed by deep learning model. MACS2&3 broad mode has better performance than MACS2&3 default mode. When no control dataset is provided, EPIC2 still remains the best, while macs2&3 broad mode gets better than the deep learning model.
For H3K4me2, EPIC2 has the best overall performance, then followed by MACS2&3 broad mode. MACS2&3 broad mode has better performance than MACS2&3 default mode. When no control dataset is provided, EPIC2 still remains the best, then MACS3 broad mode shows better application than macs2 broad mode.
For H3K4me3, EPIC2 has the best overall performance, then followed by MACS2&3 broad mode. MACS2&3 broad mode has better performance than MACS2&3 default mode. When no control dataset is provided, EPIC2 still remains the best, then MACS2 broad mode shows better application than macs3 broad mode.
For H3K27me3, EPIC2 has the best overall performance, then followed by deep learning model. MACS2&3 default mode has better performance than MACS2&3 broad mode. When no control dataset is provided, EPIC2 and deep learning model still remains the best, while macs2&3 broad mode gets better than the default mode.
For H3K27ac, MACS2&3 broad mode have the best overall performance, then followed by EPIC2. When no control dataset is provided, MACS2&3 broad mode still remains the best, then followed by their default mode.
For transcription factor CTCF, EPIC2 has the best overall performance, then followed by deep learning model. MACS2&3 broad mode has better performance than MACS2&3 default mode. When no control dataset is provided, macs2 broad mode turns out to be the best.
Table for ChIP-seq and ATAC-seq
In ChIP-seq and ATAC-seq data, as there are no significant performance change between with and without provided control input, we summarize all the results in one table.
For H3K4me3, MACS2 default mode has the best overall performance, then followed by MACS3 default mode. The deep learning model has third best performance while the EPIC2 becomes the worst method in such situation.
For H3K27ac, MACS3 default mode has the best overall performance, then followed by MACS2 default mode. The deep learning model has third best performance while the EPIC2 becomes the worst method in such situation.
For CTCF, MACS3 broad mode has the best overall performance, then followed by MACS2 broad mode. MACS2&3 overall has better performance than other methods.
For ATAC-seq data, MACS2&3 default mode has the best overall performance, then followed by MACS2&3 broad mode. MACS2&3 overall has better performance than other methods.
The general conclusion of this part is: The deep learning model performs better on ChIP-seq data
EPIC2 is designed for sparse and broader peaks, when background noise is low it will have much better performance
MACS2 and MACS3 are more sensitive to control quality
The deep learning model overfitting Overfitting is a common issue in deep learning models where the model learns the training data too well, including the noise and random fluctuations, and fails to generalize to new, unseen data. This can result in poor performance on new data and defeat the purpose of the model. If a deep learning model is used for peak calling and it overfits to the training data, it could result in false positive or false negative predictions, reducing the accuracy of the peak calling results.
To avoid overfitting in deep learning models used for peak calling, we can constrain the complexity of the model, and using techniques such as weight constraints to control overfitting.
We can see that the deep learning model shows oversensitivity due to overfitting, and the cost is lower precision.
broad and narrow mode of macs2&3 Algorithm difference: The main difference here is using the options –broad –broad-cutoff 0.1. With the option –broad on, MACS will try to composite broad by putting nearby highly enriched regions into a broad region with loose cutoff. The broad region is controlled by another cutoff through –broad-cutoff. MACS2 broad mode does a 2-level peak calling and embed stronger/narrower calls in weaker/broader calls. Assuming that all level 1 peaks should be inside level 2 peaks (this is an explicit assumption by MACS2), the algorithm groups level 1 peaks inside level 2 peaks to output a broad peak.
paired-end mode and signle-end mode In single-end sequencing, MACS uses 1000 enriched regions to model the distance d between the forward and reverse strand peaks.
In paired-end sequencing, the two ends of a DNA fragment are sequenced. The distance between the two reads represents the size of the original DNA fragment.
Single-end models the fragment lengths from the “single-end” R1 reads and then extends the read lengths to the average value from the model. If it is applied to paired-end data, half of the information would be disregarded. Treating paired-end data as single-end data increases the likelihood of identifying false peaks due to random noise.
Although using the paired-end mode of MACS2 reduces apparent false positive peaks, some false negatives are more likely to occur. In paired-end mode, the minimum length of called peaks is set to the average fragment length. Therefore, any region that achieves statistical significance but fails to meet this length requirement is automatically discarded.